12 research outputs found

    Gene Similarity-based Approaches for Determining Core-Genes of Chloroplasts

    Full text link
    In computational biology and bioinformatics, the manner to understand evolution processes within various related organisms paid a lot of attention these last decades. However, accurate methodologies are still needed to discover genes content evolution. In a previous work, two novel approaches based on sequence similarities and genes features have been proposed. More precisely, we proposed to use genes names, sequence similarities, or both, insured either from NCBI or from DOGMA annotation tools. Dogma has the advantage to be an up-to-date accurate automatic tool specifically designed for chloroplasts, whereas NCBI possesses high quality human curated genes (together with wrongly annotated ones). The key idea of the former proposal was to take the best from these two tools. However, the first proposal was limited by name variations and spelling errors on the NCBI side, leading to core trees of low quality. In this paper, these flaws are fixed by improving the comparison of NCBI and DOGMA results, and by relaxing constraints on gene names while adding a stage of post-validation on gene sequences. The two stages of similarity measures, on names and sequences, are thus proposed for sequence clustering. This improves results that can be obtained using either NCBI or DOGMA alone. Results obtained with this quality control test are further investigated and compared with previously released ones, on both computational and biological aspects, considering a set of 99 chloroplastic genomes.Comment: 4 pages, IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2014

    Finding the Core-Genes of Chloroplasts

    Full text link
    Due to the recent evolution of sequencing techniques, the number of available genomes is rising steadily, leading to the possibility to make large scale genomic comparison between sets of close species. An interesting question to answer is: what is the common functionality genes of a collection of species, or conversely, to determine what is specific to a given species when compared to other ones belonging in the same genus, family, etc. Investigating such problem means to find both core and pan genomes of a collection of species, \textit{i.e.}, genes in common to all the species vs. the set of all genes in all species under consideration. However, obtaining trustworthy core and pan genomes is not an easy task, leading to a large amount of computation, and requiring a rigorous methodology. Surprisingly, as far as we know, this methodology in finding core and pan genomes has not really been deeply investigated. This research work tries to fill this gap by focusing only on chloroplastic genomes, whose reasonable sizes allow a deep study. To achieve this goal, a collection of 99 chloroplasts are considered in this article. Two methodologies have been investigated, respectively based on sequence similarities and genes names taken from annotation tools. The obtained results will finally be evaluated in terms of biological relevance

    Combinaison d'approches pour résoudre le problème du réarrangement de génomes

    No full text
    In Bioinformatics, understanding how DNA molecules have evolved over time remains an open and complex problem.Algorithms have been proposed to solve this problem, but they are limited either to the evolution of a given character (forexample, a specific nucleotide), or conversely focus on large nuclear genomes (several billion base pairs ), the latter havingknown multiple recombination events - the problem is NP complete when you consider the set of all possible operationson these sequences, no solution exists at present. In this thesis, we tackle the problem of reconstruction of ancestral DNAsequences by focusing on the nucleotide chains of intermediate size, and have experienced relatively little recombinationover time: chloroplast genomes. We show that at this level the problem of the reconstruction of ancestors can be resolved,even when you consider the set of all complete chloroplast genomes currently available. We focus specifically on the orderand ancestral gene content, as well as the technical problems this raises reconstruction in the case of chloroplasts. Weshow how to obtain a prediction of the coding sequences of a quality such as to allow said reconstruction and how toobtain a phylogenetic tree in agreement with the largest number of genes, on which we can then support our back in time- the latter being finalized. These methods, combining the use of tools already available (the quality of which has beenassessed) in high performance computing, artificial intelligence and bio-statistics were applied to a collection of more than450 chloroplast genomes.En bio-informatique, comprendre comment les molécules d’ADN ont évolué au cours du temps reste un problème ouvert etcomplexe. Des algorithmes ont été proposés pour résoudre ce problème, mais ils se limitent soit à l’évolution d’un caractèredonné (par exemple, un nucléotide précis), ou se focalisent a contrario sur de gros génomes nucléaires (plusieurs milliardsde paires de base), ces derniers ayant connus de multiples événements de recombinaison – le problème étant NP completquand on considère l’ensemble de toutes les opérations possibles sur ces séquences, aucune solution n’existe à l’heureactuelle. Dans cette thèse, nous nous attaquons au problème de reconstruction des séquences ADN ancestrales en nousfocalisant sur des chaînes nucléotidiques de taille intermédiaire, et ayant connu assez peu de recombinaison au coursdu temps : les génomes de chloroplastes. Nous montrons qu’à cette échelle le problème de la reconstruction d’ancêtrespeut être résolu, même quand on considère l’ensemble de tous les génomes chloroplastiques complets actuellementdisponibles. Nous nous concentrons plus précisément sur l’ordre et le contenu ancestral en gènes, ainsi que sur lesproblèmes techniques que cette reconstruction soulève dans le cas des chloroplastes. Nous montrons comment obtenirune prédiction des séquences codantes d’une qualité telle qu’elle permette ladite reconstruction, puis comment obtenir unarbre phylogénétique en accord avec le plus grand nombre possible de gènes, sur lesquels nous pouvons ensuite appuyernotre remontée dans le temps – cette dernière étant en cours de finalisation. Ces méthodes, combinant l’utilisation d’outilsdéjà disponibles (dont la qualité a été évaluée) à du calcul haute performance, de l’intelligence artificielle et de la biostatistique,ont été appliquées à une collection de plus de 450 génomes chloroplastiques

    On the reconstruction of the ancestral bacterial genomes in genus Mycobacterium and Brucella

    No full text
    International audienceTo reconstruct the evolution history of DNA sequences, novel models of increasing complexity regarding the number of free parameters taken into account in the sequence evolution, as well as faster and more accurate algorithms, and statistical and computational methods, are needed. More particularly, as the principal forces that have led to major structural changes are genome rearrangements (such as translocations, fusions, and so on), understanding their underlying mechanisms, among other things via the ancestral genome reconstruction, are essential. In this problem, since finding the ancestral genomes that minimize the number of rearrangements in a phylogenetic tree is known to be NP-hard for three or more genomes, heuristics are commonly chosen to obtain approximations of the exact solution. The aim of this work is to show that another path is possible

    Ancestral Reconstruction and Investigations of Genomic Recombination on some Pentapetalae Chloroplasts

    No full text
    International audienceIn this article, we propose a semi-automated method to rebuild genome ancestors of chloroplasts by taking into account gene duplication. Two methods have been used in order to achieve this work: a naked eye investigation using homemade scripts, whose results are considered as a basis of knowledge, and a dynamic programming based approach similar to Needleman-Wunsch. The latter fundamentally uses the Gestalt pattern matching method of sequence matcher to evaluate the occurrences probability of each gene in the last common ancestor of two given genomes. The two approaches have been applied on chloroplastic genomes from Apiales, Asterales, and Fabids orders, the latter belonging to Pentapetalae group. We found that Apiales species do not undergo indels, while they occur in the Asterales and Fabids orders. A series of experiments was then carried out to extensively verify our findings by comparing the obtained ancestral reconstruction results with the latest released approach called MLGO (Maximum Likelihood for Gene-Order analysis)

    Ancestral Reconstruction and Investigations of Genomic Recombination on some Pentapetalae Chloroplasts

    No full text
    In this article, we propose a semi-automated method to rebuild genome ancestors of chloroplasts by taking into account gene duplication. Two methods have been used in order to achieve this work: a naked eye investigation using homemade scripts, whose results are considered as a basis of knowledge, and a dynamic programming based approach similar to Needleman-Wunsch. The latter fundamentally uses the Gestalt pattern matching method of sequence matcher to evaluate the occurrences probability of each gene in the last common ancestor of two given genomes. The two approaches have been applied on chloroplastic genomes from Apiales, Asterales, and Fabids orders, the latter belonging to Pentapetalae group. We found that Apiales species do not undergo indels, while they occur in the Asterales and Fabids orders. A series of experiments was then carried out to extensively verify our findings by comparing the obtained ancestral reconstruction results with the latest released approach called MLGO (Maximum Likelihood for Gene-Order analysis)

    Relation between Gene Content and Taxonomy in Chloroplasts

    No full text
    International audienceThe aim of this study is to investigate the relation that can be found between the phylogeny of a large set of complete chloroplast genomes, and the evolution of gene content inside these sequences. Core and pan genomes have been computed on de novo annotation of these 845 genomes, the former being used for producing well-supported phylogenetic tree while the latter provides information regarding the evolution of gene contents over time. It details too the specificity of some branches of the tree, when specificity is obtained on accessory genes. After having detailed the material and methods, we emphasize some remarkable relation between well-known events of the chloroplast history, like endosymbiosis, and the evolution of gene contents over the phylogenetic tree

    Comparison of metaheuristics to measure gene effects on phylogenetic supports and topologies

    No full text
    Abstract Background A huge and continuous increase in the number of completely sequenced chloroplast genomes, available for evolutionary and functional studies in plants, has been observed during the past years. Consequently, it appears possible to build large-scale phylogenetic trees of plant species. However, building such a tree that is well-supported can be a difficult task, even when a subset of close plant species is considered. Usually, the difficulty raises from a few core genes disturbing the phylogenetic information, due for example from problems of homoplasy. Fortunately, a reliable phylogenetic tree can be obtained once these problematic genes are identified and removed from the analysis.Therefore, in this paper we address the problem of finding the largest subset of core genomes which allows to build the best supported tree. Results As an exhaustive study of all core genes combination is untractable in practice, since the combinatorics of the situation made it computationally infeasible, we investigate three well-known metaheuristics to solve this optimization problem. More precisely, we design and compare distributed approaches using genetic algorithm, particle swarm optimization, and simulated annealing. The latter approach is a new contribution and therefore is described in details, whereas the two former ones have been already studied in previous works. They have been designed de novo in a new platform, and new experiments have been achieved on a larger set of chloroplasts, to compare together these three metaheuristics. Conclusions The ways genes affect both tree topology and supports are assessed using statistical tools like Lasso or dummy logistic regression, in an hybrid approach of the genetic algorithm. By doing so, we are able to provide the most supported trees based on the largest subsets of core genes
    corecore